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Abstract 

The traditional maximum likelihood estimator (MLE) is often of limited use in 
complex high-dimensional data due to the intractability of the underlying likelihood 
function. Maximum composite likelihood estimation (McLE) avoids full likelihood 
specification by combining a number of partial likelihood objects depending on small 
data subsets, thus enabling inference for complex data. A fundamental difficulty in 
making the McLE approach practicable is the selection from numerous candidate like¬ 
lihood objects for constructing the composite likelihood function. In this paper, we 
propose a flexible Gibbs sampling scheme for optimal selection of sub-likelihood com¬ 
ponents. The sampled composite likelihood functions are shown to converge to the one 
maximally informative on the unknown parameters in equilibrium, since sub-likelihood 
objects are chosen with probability depending on the variance of the corresponding 
McLE. A penalized version of our method generates sparse likelihoods with a relatively 
small number of components when the data complexity is intense. Our algorithms are 
illustrated through numerical examples on simulated data as well as real genotype SNP 
data from a case-control study. 

Keywords: Composite likelihood estimation, Gibbs sampling. Jackknife, Efficiency, Model 
Selection 
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1. INTRODUCTION 


While maximum likelihood estimation plays a central role in statistical inference, today its 
application is challenged in a number of helds where modern technologies allow scientists 
to collect data in unprecedented size and complexity. These helds include genetics, biology, 
environmental research, meteorology and physics, to name a few. Two main issues arise when 
attempting to apply traditional maximum likelihood to high-dimensional or complex data. 
The hrst concerns modelling and model selection, since high-dimensional data typically imply 
complex models for which the full likelihood function is difficult, or impossible, to specify. 
The second relates to computing, when the full likelihood function is available but it is just 
too complex to be evaluated. 

The above limitations of traditional maximum likelihood have motivated the development 
of composite likelihood methods, which avoid the issues from full likelihood maximization by 
combining a set of low-dimensional likelihood objects. Besag (1974) was an early proponent 
of composite likelihood estimation in the context of data with spatial dependence; Lindsay 
(1988) developed composite likelihood inference in its generality and systematically studied 
its properties. Over the years, composite likelihood methods have proved useful in a range 
of complex applications, including models for geostatistics, spatial extremes and statistical 
genetics. See Varin, Reid and Firth (2011) for a comprehensive survey of composite likelihood 
theory and applications. 

Let X = (Xi,..., Xd)"^ be a d X 1 random vector with pdf (or pmf) f{x\6Q), where Oq G 
© C ]RP, p > 1, is the unknown parameter. From independent observations ..., 
one typically computes the maximum likelihood estimator (MLE), Omie, by maximizing the 
likelihood function Lm/e (9) = nr.,/(v<->ie). Now suppose that complications in the d- 
dimensional pdf (pmf) f{x\6) make it difficnlt to specify (or compute) Lmie{^) as the data 
dimension d grows, but it is relatively easy to specify (or compute) one-, two-,..., dimensional 
distributions up to some order for some functions of Xi,..., Xd- One can then follow Lind¬ 
say (1988) to estimate 6 by the maximnm composite likelihood estimator (McLE), which 
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maximizes the composite likelihood function: 


M 

L,i{6) = J] L^d), (1) 

m=l 

where each Lm{0) is a user-specihed partial likelihood (or sub-likelihood) depending on 
marginal or conditional events on variables. For example, Lm can be dehned using a marginal 
event {xm} (marginal composite likelihood), pairs of variables such as {xi^X 2 } (pairwise 
likelihood), or conditional events like {xi,a: 2 }|{a;i} (conditional composite likelihood). For 
simplicity, we assume that 0 is common to all sub-likelihood components, so that any fac¬ 
torization based on a subset of {LmiO), m = 1,..., M} yields a valid objective function. 

Although the composite likelihood approach provides a flexible framework with a sound 
theory for making inference about 6 in situations involving multivariate data, there exist at 
least two challenges hindering the efficiency improvement and feasible computing of McLE in 
applications. The hrst challenge lies with selecting the right sub-likelihood components for 
constructing an informative composite likelihood function. The current practice of keeping 
all plausible factors in ([^ is not well justihed in terms of efficiency relative to MLE, since 
inclusion of redundant factors can deteriorate dramatically the variance of the corresponding 
composite likelihood estimator (e.g., see Cox and Reid (2004)). A better strategy would be 
to choose a subset of likelihood components which are maximally informative on 9q, and 
drop noisy or redundant components to the maximum extent. However, little work is found 
in the literature in regard to optimal selection of sub-likelihood components. The second 
challenge lies with the computational complexity involved in the maximization of Lci{0), 
which can go quickly out of reach as d (and M) increases. Particularly, note that computing 
Lci involves M x Nops{dci) operations, where Nops{dci) is the number of operations for each 
sub-likelihood component. The computational burden is exacerbated when © is relatively 
large and the different sub-likelihood factors LmiO) do not depend on distinct elements of 
0. One would like to see this computing complexity be alleviated to a manageable level 
by applying parsimonious likelihood composition in the presence of high (or ultra-high) 
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dimensional data. 

Motivated by the aforementioned challenges, in this paper we propose a new class of 
stochastic selection algorithms for optimal likelihood composition. Our method uses Gibbs 
sampler, a specihc Markov Chain Monte Carlo (MCMC) sampling scheme to generate in¬ 
formative - yet parsimonious - solutions. The resulting estimates will converge to the one 
maximally informative on the target parameter Oq as the underlying Markov chain reaches 
equilibrium. This is because sub-likelihood components generated in the algorithms are 
drawn according to probabilities determined by a criterion minimizing the McLE’s variance 
or its consistent approximation. Theory of unbiased estimating equations prescribes McLE’s 
asymptotic variance as an optimality criterion, i.e., the Oir-optimality criterion, see Heyde 
(1997, Ch. 2), but such an ideal objective has scarce practical utility due to the mathemati¬ 
cal complexity and computational cost involved in evaluating the numerous covariance terms 
implied by the asymptotic variance expression (Lindsay, Yi and Sun 2011). To address this 
issue, we replace the asymptotic variance by a rather inexpensive jackknife variance esti¬ 
mator computed by a one-step Newton-Raphson iteration. Such a replacement is shown to 
work very well based on our numerical study. 

Another advantage of our approach is that proper use of the Gibbs sampler can generate 
sparse composition rules, i.e., composite likelihoods involving a relatively small number of 
informative sub-likelihood components. Note that the model space implied by the set of all 
available sub-likelihood components can be large, even when d is moderate. For example, if 
d = 20, we have 2^ = 2 ^ 2 ) = 2 ^®° possible composition rules based on pair-wise likelihood 
components alone. To cope with such a high-dimensionality, we combine Gibbs sampling 
with a composite likelihood stochastic stability mechanism. Analogously to the stochastic 
stability selection proposed by Meinshausen and Biihlmann (2010) in the context of high¬ 
dimensional model selection, our approach selects a small but sufficient number of informative 
likelihood components through the control of the error rates of false discoveries. 

The paper is organized as follows: In Section we outline the main framework and basic 
concepts related to composite likelihood estimation. We also describe the OF-optimality 


4 


criterion and introduce its jackknife approximation. In Section we describe our core 
algorithm for simultaneous likelihood estimation and selection. In Section we discuss 
an extension of our algorithm by incorporating the ideas of model complexity penalty and 
stochastic stability selection. This leads to a second algorithm for parsimonious likelihoods 
composition for high-dimensional data. In Section we illustrate our methods through 
numerical examples involving simulated data and real genetic single nucleotide polymorphism 
(SNP) data from a breast cancer case-control study. Section [^concludes the paper with hnal 
remarks. 


2. SPARSE COMPOSITE LIKELIHOOD FUNCTIONS 
2.1 Binary Likelihood Composition 

Let {Ml,..., Am} be a set of marginal or conditional sample sub-spaces associated with prob¬ 
ability density functions (pdfs) fm{x G Am\0)- See Varin et ah (2011) for interpretation and 
examples of Am- Given independent d-dimensional observations ..., r\j f{x\Oo), 

00 e © C MP, p > 1, we dehne the composite log-likelihood function: 

M 

U6,u:) = ( 2 ) 

m=l 

where uj = (cji,... G f2 = (0,1}*^, and Pm{G) is the partial log-likelihood 

n 

im{e) = 5^1og/™(X« G Am\e). (3) 

i=l 

Each partial likelihood object (sub-likelihood) lm{G) is allowed to be selected or not, depend¬ 
ing on whether Um is 1 or 0. We aim to approximate the unknown complete log-likelihood 
function imie{G) = logLmie(0) by selecting a few - yet the most informative - sub-likelihood 
objects from the M available objects, where M is allowed to be larger than n and p. Given 
the composition rule n; G f2, the maximum composite likelihood estimator (McLE), denoted 
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by 0{uj), is defined by the solution of the following system of estimating equations: 

n n M 

O = ^C/«(0,u,):=^^a,„C/«(e), (4) 

i=l i=l m=l 

where C7«(0) = V 0 log/™(X« e Am\e). m = 1 ,..., M, i = 1 ,..., n, are unbiased partial 
scores functions. Under standard regularity conditions on the sub-likelihoods (Lindsay 1988), 
^/n{6{u) — 0 o(^)) -^p(O) ^o) with asymptotic variance given by the p x p matrix 

Vo(n;) = V(0o, = H(0o, ^)^'K(0 o, ^)H(0o, (5) 


where 


M 

H(6»,n;) = ^a;™H™(6/), K{e,uj) 

m=l 


Var 


■ M 

^ ^ ^mU 
jn=l 




Hm( 0 ) = Var{Um{G)) is the p x p sensitivity matrix for the mth component, and Um{G) = 

Ve\og fm{X e Am\0). 

The main approach followed here is to minimize, in some sense, the asymptotic variance, 
Vo(<-i;). To this end, theory of unbiased estimating equations suggests that both matrices 
H and K should be considered in order to achieve this goal (e.g., see Heyde (1997, Chapter 
2)). On one hand, note that H measures the covariance between the composite likelihood 
score U{6,u;)= Y.^=i^raUm{0) = log/^(X G Ajn\0) and the MLE score 

Umie{G) = V e^og f\X]6). To see this, differentiate both sides in E\U{6q,u:)] = 0 and ob¬ 
tain H( 0 o,^) = E \UmieiGQ)U{OQ, • This shows that adding sub-likelihood components 
is desirable, since it increases the covariance with the full likelihood. On the other hand, 
including too many correlated sub-likelihoods components inflates the variance through the 
covariance terms in K( 0 O 5 ^)- 


2.2 Fixed-Sample Optimality and its Jackknife Approximation 

The objective of minimizing the asymptotic variance is still undefined since \{6q,u;) in (|^ 
is a p X p positive semidefinite matrix. Therefore, we consider the following one-dimensional 
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objective function 


go{u}) = logdet{V(0o,^)} = logdet{K(0o, ^)} - 2logdet{H(0o, ^)}- (6) 


The minimizer, u>q, of the ideal objective ([^ corresponds to the Oir-optimal solution (hxed- 
sample optimality) (e.g., see Heyde (1997), Chapter 2). Clearly, such a program still lacks 
practical relevance, since (|^ depends on the unknown parameter Oq- Therefore, go(^^) should 
be replaced by some sample-based estimate, say ^o(<^)- 

One option is to use the following consistent estimates of H(0o,^ ) and K(0o,^) in ([^: 


M 


H(n;) = 


n 


EE' 

m=l 2=1 




(i), 


K(n;) = - 
n 


5^C/«(0,n;)C/«(0,n;) 


(7) 


2=1 


where the estimator 0 = 9(uj) is the McLE. Although this strategy works in simple models 
when n is relatively large and M is small, the estimator K(n;) is knowingly unstable when 
n is small compared to dim(©) (Varin et ah 2011). Another issue with this approach in 
high-dimensional datasets is that the number of operations required to compute K(a;) (or 
K(0,li;)) increases quadratically in 

To reduce the computational burden and avoid numerical instabilities, we estimate g^ by 
the following one-step jackknife criterion: 


g{u:) = logdet i ^ (^(^)^ 
I i=i 


0(n;)) 



( 8 ) 


where the pseudo-value is a composite likelihood estimator based on a sample 

without observation and 0[uij) = X]r=i Alternatively, one could use the 

delete-/c jackknife estimate where k > 1 observations at the time are deleted to compute 
the pseudo-values. The delete-/c version is computationally cheaper than delete-1 jackknife 
and therefore should be preferred when the sample size, n, is moderate or large. Other 
approaches - including bootstrap - should be considered depending on the model set up. 
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For example, block re-sampling techniques such as the block-bootstrap (see Hall, Horowitz 
and Jing (1995) and subsequent papers) are viable options for spatial data and time series. 

When the sub-likelihood scores are in closed form, the pseudo-values can be efficiently 
approximated by the following one-step Newton-Raphson iteration: 


M 


M 


= e + c/'pe) 


( 9 ) 


<771=1 




<m=l 




where 6 is any root-n consistent estimator of 6. Note that 6 does not need to coincide with 
the McLE, 0{lo), so a computationally cheap initial estimator - based only on a few small sub- 
likelihoods subset - may be considered. Remarkably, the number of operations required in the 
Newton-Raphson iteration (§ grows linearly in the number of sub-likelihood components, so 
that the one-step jackknife objective has computational complexity comparable to a single 
evaluation of the composite likelihood function (|^. Large sample properties of jackknife 
estimator of McLE’s asympotic variance can be derived under regularity conditions on Um 
and Hm analogous to those described in Shao (1992). Then, for the one-step jackknife 
using the root-n consistent starting point 6, we have n{g{(jj) — go{<^)) ^0, as n —)■ oo. 
uniformly on f2. Moreover, the estimator g{Lo) is asymptotically equivalent to the classic 
jackknife estimator. 


3. PARAMETER ESTIMATION AND LIKELIHOOD COMPOSITION 
3.1 Likelihood Composition via Gibbs Sampling 

When computing the McLE of 6, our objective is to hnd an optimal binary vector lo* to 
estimate ujq = argmin^g^^ where go(-) is the ideal objective function dehned in ([^. 

Typically, the population quantity go cannot be directly assessed, so we replace go with the 
sample-based jacknife estimate g described in Section and aim at Ending 


LO* = argmin g{u}). 
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This task, however, is computationally infeasible through enumerating the space if d is 
even moderately large. For example, for composite likelihoods defined based on all pairs of 
variables (Xs,Xt), 1 < s < t < d (pairwise likelihood), 17 contains 2 ^ 2 ) = 2 ^^° elements when 
d = 20. 

To overcome this enumeration complexity, we carry out a random search method based 
on Gibbs sampling. We regard the weight vector a; as a random vector following the joint 
probability mass function (pmf) 


Tlriuj) 


1 

W) 


exp{-Tg{uj)}, 


Cl? G 


( 10 ) 


where Z{t) = Xloiennormalizing constant. The above distribution 
depends on the tuning parameter r > 0, which controls the extent to which we emphasize 
larger probabilities (and reduce smaller probabilities) on 17 . Then oj* is also the mode of 
7rT-(a;), meaning that oj* will have the highest probability to appear, and will be more likely 
to appear earlier rather than later, if a random sequence of oj is to be generated from 7r^(a;). 

Therefore, estimating oj* can be readily done based on the random sequence generated 
from 7rT-(a;). But generating a random sample from Tiri^) directly is difficult because Tiri^) 
contains an intractable normalizing constant Z{t). Instead, we will generate a Markov 
chain using the product of all univariate conditional pmf’s with respect to as the 

transitional kernel. The stationary distribution of such a Markov chain can be proved to be 
(Robert and Casella 2004, Chapter 10). Hence, the part of this Markov chain after 
reaching equilibrium can be regarded as a random sample from Tirioj) for most purposes. The 
MCMC method just described is in fact the so-called Gibbs sampling method. The key factor 
for the Gibbs sampling to work is all the univariate conditional probability distributions of 
the target distribution can be relatively easily simulated. 

Let us write u) = (cui, • • ■ ,um); - ■ ■ ,^m 2 ), if mi < m 2 , and 

<^mi:m 2 = ^ Otherwise; and u>-m = ( 1 ^ 1 , • • ■ , ■ ■ ■ , ^m)- Then it is easy to see that 
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the conditional pmf of Um given oj-m, m = 1, ■ ■ ■ , M, is 


TTt (cJn 




exp{-r^(n;)} 


exp{-r^(n;i^‘)} + exp{-r^(n;[;'i')} 


OJr. 


= 0 , 1 , 


( 11 ) 


where oom = (cc’i:(m-i), j, i^(m+i):M), j = 0,1. Note that 7rT-(c(;m|<^-m) is simply a Bernonlli pmf 
and so it is easily generated. For each binary vector u), g{uj) is compnted nsing the one-step 
jackknife estimator described in Section]^ Therefore, the probability mass fnnction 
is well dehned for any r > 0. On the other hand, we have shown that the Gibbs sampling 
can be nsed to generate a Markov chain from 7rT-(Li;), from which we can hnd a consistent 
estimator u)* for the mode uj* if lo* is nniqne. Consequently, the universally maximum 
composite log-likelihood estimator of 6 can be approximated by the McLE, 

Note that the mode lo* is not necessarily unique. In this case one can still consider 
consistent estimation for u:* but in the following meaning. We know is the minimum 

and always unique according to its dehnition. Thus can be consistently and uniquely 

estimated based on the Markov chain of ^(n;) induced from the Markov chain of n; generated 
from 'n'T-{u:) when the length of the Markov chain goes to inhnity. Therefore, any estimator 
tb* such that g{tb*) becomes consistent with g{<jJ*) can be regarded as a consistent estimator 
of uj*. Consequently, the McLE 0{uj*) for each such uj* is still the universally McLE but not 
necessarily unique. From a practitioner’s view point there is no need to identify all consistent 
estimators of uj* and all universally McLE of 9. Finding out one such uj* or a tight superset 
of it would be a sufficient advance in parsimonious and efficient likelihood composition. 


3.2 Algorithm 1: MCMC Composite Likelihood Selection (MCMC-CLS) 

The above discussion motivates the steps of our core Gibbs sampling algorithm for simulta¬ 
neous composite likelihood estimation and selection. Let r be given and hxed. 


0. For f = 0, choose an initial binary vector of weights and compute the one-step 
jackknife estimator g{uj^'^'>). E.g. randomly set 5 elements of to 1 and the rest to 

0 . 
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1. For each t = I,-- - ,T for a given T, obtain by repeating 1.1 to 1.4 for each 
m = 1, • • • , M sequentially. 

1.1. Compute, if not available yet, j = 0,1. 

1.2. Compute the conditional pmf of Um given '^(m+i)-M)- 

^ exp{-T^J, } (12) 

where j = 0,1. Note this is a Bernoulli pmf. 

1.3. Generate a random number from the Bernoulli pmf obtained in 1.2, and denote 
the result as Um- 

1.4. Set Ljh) ^ compute and record 

2. Compute lj* = argmini<i<'r and regard it as the estimate of uj*. Alternatively, 

column-combine a;C)^... generated in Step 1 into an M x T matrix W; then 

compute row averages of W, say Wi, • • ■ , Um, and set • • • , a;)b) where = 1, 

if Urn > and 0^^ = 0 if Um < where ^ is some constant larger than 0.5. 

3. Finally, compute 0{u)*) (or 6{u}*)) and g{u)*) (or g{u)*)). 

Firstly, note that Gibbs sampling has been used in various contexts in the literature of model 
selection. George and McCulloch (1993) used a similar strategy to generate the distribution 
of the variable indicators in Bayesian linear regression. Qian (1999) used Gibbs sampler for 
selecting robust linear regression models. Qian and Field (2002) used the Gibbs sampler 
for selecting generalized linear regression models. Brooks, Friel and King (2003) and Qian 
and Zhao (2007) used Gibbs sampling for selection in the context of time series models. 
However, to our knowledge this is the hrst work proposing a general-purpose Gibbs sampler 
for construction of composite likelihoods. 

Secondly, the sequence is a Markov chain, which requires an initial vector 

and a burn-in period to be in equilibrium. Values of do not affect the eventual 
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attainment of equilibrium so can be arbitrarily chosen. From a computational point of view 
most components of should be set to 0 to reduce computing load. For example, we can 
randomly set all but 5 of the components to 0. To assess whether the chain has reached 
equilibrium, we suggest the control-chart method discussed in Qian and Zhao (2007). For 
the random variable ^(n;), we have the following probability inequality for any given b > 1: 



- min^(a;) > b^/Var[g{uj)] + {E[g{uj)] 


— min 




This inequality can be used to find an upper control limit for g{Lo). For example, by set¬ 
ting b = a/TO, an at least 90% upper control limit for ^(n;) can be estimated as g* + 
y/lOs^ -|- 10(^ — g*y, where g*, g and are the minimum, sample mean and sample vari¬ 
ance based on the first N observations, g{u:^^^),... N <T, where typically we set 

N = [r/2j. We then count the number of observations passing the upper control limit in the 
remaining sample If more than 10% of the points are above the con¬ 

trol limit, then at a significance level not more than 90% there is statistical evidence against 
equilibrium. Upper control limits of different levels for g{Lo) can be similarly calculated and 
interpreted by choosing different values of b. 

Thirdly, tb* computed in Step 2 is simply a sample mode of 71t{^) based on its definition in 


(10). Hence, by the Ergodic Theorem for stationary Markov chain, lj* is a strongly consistent 
estimator of lo under minimal regularity conditions. With similar arguments Um is a strongly 
consistent estimator of the success probability involved in the marginal distribution of Um 
induced from 7ir{iio). Hence, it is not difficult to see that the resultant estimator uj* should 
satisfy without requiring T to be very large. Propositions 1 and 2 in Qian and 

Zhao (2007) provide an exposition of this property. Therefore, the estimator uj* captures all 
informative sub-likelihood components with high probability. 

Finally, the tuning constant r adjusts the mixing behavior of the chain, which has impor¬ 
tant consequences on the exploration/exploitation trade-off on the search space fl. If r is too 
small, the algorithm produces solutions approaching the global optimal value uj* slowly; if r 
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is large, then the algorithm hnds local optima and may not reach u*. The former behavior 
corresponds to a rapidly mixing chain, while the latter occurs when the chain is mixing too 
slowly. In the composite likelihood selection setting, the main hurdle is the computational 
cost, so r should be set according to the available computing capacity, after running some 
graphical or numerical diagnostics (e.g., see Robert and Casella (2004)). We choose to use 
r = d in our empirical study, which does not seem to create adverse effects. 


4. AN EXTENSION FOR HIGH-DIMENSIONAL DATA 
4.1 Sparsity-enforcing penalization 

Without additional modihcations. Algorithm 1 ignores the likelihood complexity, since so¬ 
lutions with many sub-likelihoods have in principle the same chance to occur as those with 
fewer components. To discourage selection of overly complex likelihoods, we augment the 


Gibbs distribution (10) as follows: 


^exp{-r^A(t^)}, (13) 

where 

dA(t*^) = +pen(n;|A), r > 0, A > 0, (14) 

is the jackknifed variance objective defined in Section]^ Z(r, A) is the normalization 
constant, and pen{(jj) is a complexity penalty enforcing sparse solutions when dim(r2) is 
large. Maximization of is interpreted as a maximum a posteriori estimation for n;, 

where the probability distribution proportional to exp{—pen(aj|A)} is regarded as a prior 
pmf over f2. In this paper, we use the penalty term of form pen(n;|A) = since 

it corresponds to well-established model-selection criteria. For example, choices A = 1, 
A = 2“^logn and A = log log n correspond to the AIG, BIG and HQG criteria, respectively 
(e.g., see Glaeskens and Hjort (2008)). Other penalties could be considered as well depending 
on the model structure and available prior information; however, these are not shown to be 
crucial based on our empirical study so will not be explored in this paper. 
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4.2 Composite Likelihood Stability Selection 

To find the optimal solution lj*, one could compute a sequence of optimal values , • • •, 
and then take There are, however, various issues in this approach: first, 

the globally optimal value oj* might not be a member of the set since the mode of 

7rT-,A(^) is not necessarily the composite likelihood solution which minimizes g{oj). Second, 
even if oj* is in such a set, determining A is typically challenging. To address the above 
issues, we employ the idea of stability selection, introduced by Meinshausen and Biihlmann 
(2010) in the context of variable selection for linear models. Given an arbitrary value for 
A, stochastic stability exploits the variability of random samples generated from 7rr,A(^) by 
the Gibbs procedure, say ,... ^ and choses all the partial likelihoods that occur in 

a large fraction of generated samples. For a given 0 < .^ < 1, we define the set of stable 
likelihoods by the vector with elements 




stable 

m 


= < 


1. it 

0, otherwise. 


(15) 


SO we regard as stable those sub-likelihoods selected more frequently and disregard sub- 
likelihood items with low selection probabilities. Following Meinshausen and Biihlmann 
(2010), we choose the tuning constant ^ using the following bound on the expected number 
of false selections, V: 

where r]x is average number of selected sub-likelihood components. In multiple testing, 
the quantity a = E(y)/M is sometimes called the per-comparison error rate (PGER). By 
increasing only few likelihood components are selected, so that we reduce the expected 
number of falsely selected variables. We choose the threshold ^ by fixing the PGER at some 
desired value (e.g., a = 0.10), and then choose ^ corresponding to the desired error rate. The 
unknown quantity r]x in our setting can be estimated by the average number of sub-likelihood 
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components over T Gibbs samples. 


Finally, note that tuning ^ according to (16) makes redundant the determination of the 


optimal A value as long as pen{u}\\) in (14) is not dominant over This is further 

supported by our empirical study where we found the effect of A on is negligible. 

4.3 Algorithm 2: MCMC Composite Likelihood with Stability Selection (MCMC-CLS2) 
The preceding discussions lead to Algorithm 2 which is essentially the same as Algorithm 1 
with two exceptions: (i) we replace g in Algorithm 1 by the augmented objective function 


gx defined in (14); (ii) Steps 2 in Algorithm 1 is replaced by the following stability selection 
step. 

2'. Column-combine • • ■ , generated in Step 1 into an MxT matrix W. Compute 

the row averages of W, denoted as (chi, ■ ■ • he. Um = = 

1, • • • , M. Then set = ((h*, • • • ,a)jb) where = 1 if Um > Ia and = 0 

if uim < Ia, = 1, • ■ ■ ) where |a = - f + 1 ) , and a is a nominal level for 

2 \aM^ J 

per-comparison error rate (e.g., 0.05 or 0.1). 


The estimated threshold is obtained from (16) by plugging-in gx = the 

sample average of the numbers of selected sub-likelihood components in T Gibbs samples. 


5. NUMERICAL EXAMPLES 
5.1 Normal Variables with Common Location 

Let X ~ Vrf(/il,S), where the parameter of interest is the common location /i. We 
study the scenario where many components bring redundant information on /i by considering 
covariance matrix S with elements {S}mm = for all 1 < m < d, and off-diagonal elements 
{'S}im = p > 0 if I, m < d*, for some d* < d, while {S}/m = 0 elsewhere. 

We consider one-wise score composite likelihood estimator solving 0 = J2m=i ^rnUm{p), 
where Um{p) = ~ lA- ^ ^ ^asy to find the composite likelihood estimator is 

= Yfm=i^rnXmlYfra=i^rn where ^his simple model, the 

jackknife criterion g{-) can be easily computed in closed form. The pseudo-values are 
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‘V) = El. and the average of pseudo-values is 'jld- It can be 

shown that 


^(a;) = log^ ( -X^) j -21og(^ 




z=l \m=l / \m=l 

up to a constant not depending on n;. It can also be shown that O^T’-criterion has the 
following expression 


go{u:) = logVar (/id(n;)) oclog j ^ -h 2p ^ uiUm j - 2 log j ^ 

l<m<d* 


UJr, 


(17) 


\m=l 


\m=l 


depending on the unknown parameter p. This suggests that including too many correlated 
(redundant) components (with p 7 ^ 0) damages McLE’s variance as d increases. Partic¬ 
ularly, setting Uj = 1 , for all j = implies Var(Jlci{uj)) = 0(1), while choosing 

only uncorrelated sub-likelihoods {coj = 0 if 2 < j < d*, and ojj = 1 elsewhere), gives 
V ar{%i{u})) = 0{d~^). 

In Table 1, we show Monte Carlo simulation results from B = 250 runs of Algorithm 1 


(MCMC-CLSl) using the two approaches described in Section 3.2: one consists of choosing 
the best weights vector (CLSl min); the other uses thresholding, i.e. selects elements when 
the weights are selected with a sufficiently large frequency (CLSl thres.). In the same table 
we also report results on the Algorithm 2 (MCMC-CLS2) based on the stochastic stability 
selection approach. Our algorithms are compared with the estimator including all one-wise 
sub-likelihood components (No selection) and the optimal maximum likelihood estimator 
(MLE). We compute the MLE of p in two ways: either based on using the known S value or 
based on using the sample covariance S = (n — 1)“^ Sr=i(^* ~ X){Xi — X)'^. Note that the 
latter is not available when d > n. Note our algorithms are designed to obtain simultaneous 
estimation and dimension reduction in the presence of limited information (i.e. d > n or 
d^ n) where the optimal weights are difficult to estimate from the data. Stochastic selection 
with ^ = 0.7 and T = lOd was carried out for samples of n = 5,25 and 100 observations 
from a model with d* = 0.8 x d correlated components, with d = 10,30. To compare the 
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methods we computed Monte Carlo estimates of the variance (Var) and squared bias {Bias‘d) 
of Table 2 further reports the average number of selected likelihood components (no. 

comp). 


For all considered data dimensions, our selection methods outperform the all one-wise 
likelihood (AOW or No selection) estimator and show relatively small losses in terms of 
mean squared error (Var + Bias^) compared to the MLEs. The gains in terms of variance 
reduction are particularly evident for larger data dimensions (e.g., see d = 30). At the same 
time, our method tends to select mostly uncorrelated sub-likelihoods, while discarding the 
redundant components which do not contribute useful information on fi. 

Next, we illustrate the selection procedure based on MCMC-CLS2 (Algorithm 2). We 
draw a random sample of n = 50 observations from the model A'"rf(/il, S(p)) described above 
with d = 250, d* = 0.8d, and p = 0.9. This corresponds to 200 strongly redundant variables 
and 50 independent variables. We applied Algorithm 2 with a = 0.1 and A = 1 (correspond¬ 
ing to the AIC penalty). In Figure 1 (b), we show the relative frequencies of the components, 
Ef=i Um, m = 1,... ,250, in T = 1000 MCMC iterations. As expected, the uncorre¬ 
lated sub-likelihood components (components 201-250) are sampled much more frequently 
than the redundant ones (components 1-200). Figure 1 (c) shows the objective function 
(up to a constant) evaluated at the best solution computed from past MCMC 
samples (.^ = 0.7). Figure 1 (d) shows the Hamming distance, dist(u},u}') = 7^ ^'j)) 

between the current selected rule, uj* and true optimal value, uj* = (1, 0 ,..., 0,1,..., 1). 


199 50 

Overall, our algorithm quickly approaches the optimal solution and the hnal selection has 
94.0% asymptotic relative efficiency (ARE) compared to MLE. When no stochastic selection 
is applied and all 250 sub-likelihood components are included, the relative efficiency is only 
13%. As far as compnting time is concerned, our non-optimized R implementations of the 
MCMC-CLSl and MCMC-CLS2 algorithms for the above example takes, respectively 4.25 
and 1.87 seconds per MCMC iteration, respectively. The compnting time was measured on 
a laptop compnter with Intel ©Core™ i7-2620M CPU 2.70GHz. 
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No selection CLSl (min) CLSl (thresh.) CLS2 MLE (known S) MLE (unknown S) 

^ar Bias‘d Var Bias‘d Var Bias‘d Var Bias‘d V ar Bias‘d Var Bias‘d 
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with stability selection (CLS2) with a = 0.1 and A = 1; Maximum likelihood estimator based on where S is known; 
Maximum likelihood estimator based on where S can be estimated if d < n, otherwise the value is missing and denoted 
by NA. For each method we show the hnite sample variance {Var) and squared bias {Bias‘d) for n = 5,25,100, d = 10, 30 and 
p = 0.5, 0.9. Estimates based on i? = 250 Monte Carlo runs (simulation settings: r = d, T = lOd , ^ = 0.7). Monte Carlo 
standard errors are smaller than 0.001. 



n 

d 

P 

No selection 

CLSl (min) 

CLSl (thres.) 

CLS2 

5 

10 

0.50 

10 

3.77 

3.92 

3.72 



0.90 

10 

3.28 

2.58 

2.36 


30 

0.50 

30 

13.94 

11.01 

10.84 



0.90 

30 

12.04 

6.53 

6.43 

25 

10 

0.50 

10 

4.30 

3.58 

3.26 



0.90 

10 

3.27 

2.04 

2.00 


30 

0.50 

30 

12.81 

7.70 

7.48 



0.90 

30 

11.96 

5.98 

5.98 

100 

10 

0.50 

10 

4.28 

2.56 

2.31 



0.90 

10 

3.17 

2.00 

2.00 


30 

0.50 

30 

12.22 

6.08 

6.05 



0.90 

30 

11.94 

6.00 

6.00 


Table 2: Number of selected sub-likelihoods by different methods. MCMC-CLSl algo¬ 
rithm with and without thresholding (CLSl (min) and MCMC-CLSl (thresh.), respectively); 
MCMC-CLS2 algorithm with stability selection (CLS2) with a = 0.1 and A = 1. For each 
method we show results for n = 5, 25,100, d = 10, 30 and p = 0.5, 0.9. Estimates based on 
B = 250 Monte Carlo runs (simulation settings: t = d, T = lOd , ^ = 0.7). Monte Carlo 
standard errors are smaller than 0.01. 


5.2 Exchangeable Normal Variables with Unknown Correlation 

Let X ~ iVrf(0, S(p)), where S(p) = {(1 — p)Id + plrflj} and 0 < p < 1 is the unknown 
parameter of interest. The marginal univariate sub-likelihoods do not contain information 
on p, so we consider pairwise sub-likelihoods 


71 

Mp) = -^iog(i-P^)- 


{SSu + SSmm) , pSSlm 


o/'i 


+ 




1 < / < 




(18) 


where SSmm = SSim = ■ Given a;, we estimate p by solving 

the composite score equation 0 = Ylij<k^jkUjk{p) by Newton-Raphson iterations where each 
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pairwise score 

Ujk{p) = (1 + p‘^)SSjk — p{SSjj + SSkk) + np{l — p^) (19) 

is a cubic function of p. It is well known that the composite likelihood estimation can 
lead to poor results for this model. Cox and Reid (2004) carries out asymptotic variance 
calculations, showing that efficiency losses compared to MLE occur whenever d > 2, with 
more pronounced efficiency losses for p near 0.5. Particularly, if d = 2 (exactly one pair), 
ARE = 1; if d > 2, ARE=1 if p approaches 0 or 1. The next simulation results show that 
in hnite samples composite likelihood selection is advantageous even in such a challenging 
situation. 

Since closed-form pairwise score expressions are available for this model, we use the 
objective function ^(n;) based on the one-step jackknife with pseudo-values computed as 
in Equation ([^. In Table 3, we present results from B = 250 Monte Carlo runs of the 
MCMC-CLS algorithm applied to the model with d = 5, 8,10 dimensions, which correspond 
to M = ( 2 ) = 10,28,45 sub-likelihoods, respectively. We compute Monte Carlo estimates 
for the hnite-sample relative efficiency RE = Var{papw)/Var{pci{u}*)), where Var denotes 
the Monte Carlo estimate of the finite-sample variance pci{dj*) is the estimator selected 
by the MCMC-CLS algorithm, while papw is the all pairwise (APW) estimator obtained by 
including all available pairs (thus RE > 1 indicates that our stochastic selection outperforms 
no selection). We show values of p around 0.5, since they correspond to the largest asymptotic 
efficiency losses of pairwise likelihood compared to MLE (see Cox and Reid (2004), Figure 
1). In all considered cases, our stochastic selection method improves the efficiency of the 
estimator based on all pairwise components; at the same time, our composite likelihoods 
employ a much smaller number of components. For example, when p = 0.6 the efficiency 
improvements range from 8% to 39%, using only about half of the available components. 

5.3 Real Data Analysis: Australian Breast Cancer Family Study 

In this section, we apply the MCMC-CLS algorithm to a real genetic dataset of women with 
breast cancer obtained from the Australian Breast Cancer Family Study (ABCFS) (Dite, 
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n = 10 



n = 50 



II 

II 

10 

28 

45 

10 

28 

45 

II 

o 

RE 

1.27(0.03) 

1.11(0.01) 

1.07(0.01) 

1.15(0.01) 

1.12(0.01) 

1.06(0.01) 


No. comp. 

5.83(0.09) 

13.70(0.16) 

21.12(0.22) 

7.30(0.08) 

13.82(0.16) 

21.45(0.22) 

p = 0.5 

RE 

1.36(0.02) 

1.14(0.01) 

1.06(0.01) 

1.19(0.01) 

1.11(0.01) 

1.06(0.01) 


No. comp. 

5.51(0.08) 

13.32(0.07) 

20.96(0.22) 

7.01(0.07) 

13.64(0.18) 

21.23(0.22) 

o 

II 

RE 

1.39(0.02) 

1.17(0.01) 

1.10(0.01) 

1.38(0.02) 

1.14(0.01) 

1.08(0.01) 


No. comp. 

5.80(0.08) 

12.60(0.16) 

20.99(0.20) 

5.39(0.08) 

13.27(0.15) 

21.45(0.21) 


Table 3: Pairwise likelihood selection for Model 2, Nd{0, S(p)) based on Algorithm 1: Monte 
Carlo estimates for; (i) the relative efficiency of the parameter estimate under selection versus 
no selection ( RE = Var{papw)/Var{pci{u>*)), so that RE > 1 indicates that the selection 
outperforms no selection); (ii) and number of sub-likelihood components (No. comp.). Monte 
Carlo standard errors are in parenthesis. Simulation settings: r = d, chain length T = lOd, 
^ = 0.7. 

Jenkins, Southey, Hocking, Giles, McCredie, Venter and Hopper 2003) and control subjects 
from the Australian Mammographic Density Twins and Sisters Study (Odefrey, Stone, Gur- 
rin, Byrnes, Apiceha, Dite, Cawson, Giles, Treloar, English et al. 2010). Ah women were 
genotyped using a Human610-Quad beadchip array. The hnal data set that we used consisted 
of a subset of 20 single nucleotide polymorphisms (SNPs) corresponding to genes encoding a 
candidate susceptibility pathway, which is motivated by biological considerations. After rec¬ 
ommended data cleaning and quality control procedures (e.g., checks for SNP missingness, 
duplicate relatedness, population outliers (Weale 2010)), the hnal data consisted of n = 333 
observations (67 cases and 266 controls). 

To detect group effects due to cancer, we consider an extension of the latent multivariate 
Gaussian model hrst introduced by Han and Pan (2012). Let = (Y’/*^..., vj*^), i = 
1 ,... ,n, be independent observations of a multivariate categorical variable measured on n 
subjects. Each variable can take values 0,1 or 2, representing the copy number of one of 


21 



the alleles of SNP k of subject i. The binary variable = 0 or 1 represents disease 

status of the ith subject (0 = control and 1 = disease). We assume a latent random d-vector 
— (^Z^\... ^ ~ R), where is a conditional mean vector with 

elements {6) = ■■■ = [6) = and R is the correlation matrix. Our main interest is 

on the unknown mean parameter 6, which is common to all the SNP variables and represents 
the main effect due to disease. We assume < 7 ^ 1 ), 

P(Ff = 1|X(*) = a;«) = P(7fci < 4*^ < 7 ^ 2 ), and P(F® = 2|X« = = P(Z« > 7 , 2 ), 

where 7^1 and 7^2 are SNP-specihc thresholds. The above model reflects the ordinal nature of 
genotypes and assumes absence of the Hardy-Weinberg equilibrium (HWE) (allele frequencies 
and genotypes in a population are constant from generation to generation). If the HWE 
holds the parameters 'yik and 72 *; are not needed, since we have the additional constraint 
P(yW = 2) = P(y« = 1)2. 

Let 7 = {(7ifc,72fc) : k = 1,... ,d} and dehne intervals TkiY^"^) to be (-00,7^1], (7^1,7^2] 
and [7^2, cxd), corresponding to = 0,1 and 2 , respectively. The full log-likelihood function 
is 


i{e, j,r) = J2 log 

i=l 


= 5^ log 


'rifoW) 



... ,Zd\^i^"\0),R)dzi ■ --dzd, 


where f{zi,... ,Zd\i-i,R) is the pdf of the d-variate normal density function with mean /j, 
and correlation matrix R. Clearly, the full log-likelihood function is intractable when d is 
moderate or large, due to the multivariate integral in the likelihood expression. Note that 
for the marginal latent components, we have ~ Ni{6x^^\ 1 ), so 7 and 6 can be estimated 
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by minimizing the one-wise composite log-likelihood 


d n 

uo, = p{yf = sth’f=i“') (20) 

k=l i=l 
d n p 

= / . 4>{zk\0x^"\l)dzk, ( 21 ) 

k=l i=l 

where 1) denotes the normal pdf with mean /i and unit variance. We focus on using 

the one-wise composite log-likelihood in this section, except when R is to be estimated where 
we use pairwise composite log-likelihood. Differently from the expression in Han and Pan 
(2012), the disease group effect 9 is common to multiple sub-likelihood components; also, 
we allow for the inclusion/exclusion of particular sub-likelihood components (corresponding 
to SNPs) by selecting uj. We estimated the optimal composition rule uj* based on Gibbs 
samples from Algorithm 1, where the objective function g{u:) = \ogVar{6{u})) was estimated 
by delete-10 jackknife. We selected hve marginal likelihoods (SNPs) occurring with at least 
^ = 0.7 frequency in 250 runs of the Gibbs sampler (see Figure 2 (a)). In Figure 2 (b), we 
show the trajectory of the objective function g{u:) evaluated at the current optimal solution 
(optimal solutions are computed from past samples using a ^ = 0.7 threshold). The 
estimated variance tends to sway toward the minimum as more composition rules are drawn 
by our Gibbs sampler. This behavior is in agreement with preliminary simulation results 
carried out on this model (not presented here) as well as the examples presented in Sections 
OandIO 

Figure 2 (c) shows the empirical distribution of parameter estimates, 9^^^ = 9{(jJ^^^), 
based on sampled vectors t = 1,...,250. The vertical dashed line corresponds to 
the selected parameter estimate 9{Lb*), which is located near the mode of the empirical 
distribution. Particularly, the selected McLE is 9{lj*) = —0.125 and the corresponding 
delete-10 jackknife standard error is sd{9{LJ*)) = 0.012. The McLE based on using all 20 
target SNPs is 9aii = —0.112 with the delete-10 jackknife standard error sd{9aii) = 0.042. 
Our estimator gives a substantial accuracy improvement, supporting the conclusion of a 
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difference between case and control groups (i.e., 9^0) with higher conhdence. Finally, in 
Figure 2 (d), we show estimates for the correlation matrix R for the target SNPs, based on 
the pairwise composite likelihood described in Han and Pan (2012). 

6 . FINAL REMARKS 

Composite likelihood estimation is a rapidly-growing need for a number of helds, due to 
the astonishing growth of data complexity and the limitations of traditional maximum like¬ 
lihood estimation in this context. The Gibbs sampling protocol proposed in this paper 
addresses an important unresolved issue by providing a tool to automatically select the most 
useful sub-likelihoods from a pool of feasible components. Our numerical results on simu¬ 
lated and real data show that the composition rules generated by our MCMC approach are 
useful to improve the variance of traditional McLE estimators, typically obtained by using 
all sub-likelihood components available. Another advantage deriving from our method is 
the possibility to generate sparse composition rules, since our Gibbs sampler selects only a 
(relatively small) subset of informative sub-likelihoods while discarding non-informative or 
redundant components. 

In the present paper, likelihood sparsity derives naturally from the discrete nature of 
our MGMG approach based on binary composition rules with uj G {0,1}^. On the other 
hand, the development of sparsity-enforcing likelihood selection methods suited to real¬ 
valued weights would be valuable as well and could result in more efficient composition 
rules. For example, Lindsay et al. (2011) discuss the optimality of composite likelihood es¬ 
timating functions involving both positive and negative real-valued weights. Actually, the 
row averages cJi, • • • ,cJm computed in Step 2 of Algorithm 1 can also be used to replace 
n; = (cji,--- tOJmY dehned by ([^, providing a composite log-likelihood with 

between-O-and-1 weights. It would be of interest to see how well this form of composite log- 
likelihood gets on in regard to the efficiency. This, however, was not pursued in this paper 
and is left for future exploration. Finally, the penalized version of the objective function 
described in Section enforces sparse likelihood functions, which is necessary in situations 
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where the model complexity is relatively large compared to the sample size. Thus devel¬ 
oping a thorough theoretical understanding on the effect of the penalty on the selection as 
d, n —)■ cx) would be very valuable for improved selection algorithm in high-dimensions. 
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(b) 





Figure 1: Stochastic selection for Model 1, iVd(/xl, S), based on Algorithm 2. (a) Objective 
function gx{u:) evaluated at samples drawn from tt,- the horizontal solid line is the con¬ 
trol chart limit as described in Section 3.2 (b) Observed frequency of sampled sub-likelihood 
components, (c) Estimated objective function evaluated at the progressively selected likeli¬ 


hood, based on past samples, (d) Hamming distance between the progressively se¬ 


lected likelihood, and the globally optimal solution, n;*. Simulation settings: n = 50, 

d = 250, d* = 200 t = d, T = 1000 , burn-in length = 250, a = 0.1, A = 1. 
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Figure 2 : Composite likelihood selection for the ABCFS data by Algorithm 1 : (a) Fre¬ 
quency of the sampled marginal likelihood components; (b) Objective function evaluated 
at the current solution uj* (computed from past samples with ^ = 0.7); (c) Parameter es¬ 
timates, 0 d) = based on sampled composition rules, with optimal parameter 

estimate 0{u}*) corresponding to vertical dashed line, (d) Pairwise likelihood estimates for 
the correlation matrix R for SNPs in the susceptibility pathway. 
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